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Abstract 

Tensor methods have emerged as a powerful paradigm for consistent learning of many latent variable 
models such as topic models, independent component analysis and dictionary learning. Model parameters 
are estimated via CP decomposition of the observed higher order input moments. However, in many 
domains, additional invariances such as shift invariances exist, enforced via models such as convolutional 
dictionary learning. In this paper, we develop novel tensor decomposition algorithms for parameter 
estimation of convolutional models. Our algorithm is based on the popular alternating least squares 
method, but with efficient projections onto the space of stacked circulant matrices. Our method is 
embarrassingly parallel and consists of simple operations such as fast Fourier transforms and matrix 
multiplications. Our algorithm converges to the dictionary much faster and more accurately compared 
to the alternating minimization over filters and activation maps. 

Keywords: Tensor CP decomposition, convolutional dictionary learning, convolutional ICA, blind de- 

convolution. 


1 Introduction 

The convolutional dictionary learning model posits that the input signal x is generated as a linear combination 
of convolutions of unknown dictionary elements or filters /i,... /£ and unknown activation maps : 

x=j2f:^w*, ( 1 ) 

ie[L] 


where \L\ := 1,... ,L. The vector w* denotes the activations at locations, where the corresponding filter 
/* is active. Convolutional models are ubiquitous in machine learning for image, speech and sentence 
representations [71II311I], and in neuroscience for modeling neural spike trains [101118) . Deep convolutional 
neural networks are a multi-layer extension of these models with non-linear activations. Such models have 
revolutionized performance in image, speech and natural language processing 

In order to learn the model in o, usually a square loss reconstruction criterion is employed: 


min 

/i.U'ijl/ilHl 


\\x- 

ie[L] 


( 2 ) 


The constraints (||/i|| = 1) are enforced, since otherwise, the scaling can be exchanged between the filters fi 
and the activation maps Wi. Also, an additional regularization term is usually added to the above objective, 
e.g. to promote sparsity on wt, an £i term is added. 

A popular heuristic for solving m is based on alternating minimization [8], where the filters fi are 
optimized, while keeping the activations Wi fixed, and vice versa. Each alternating update can be solved 
efficiently (since it is linear in each of the variables). However, the method is computationally expensive in 
the large sample setting since each iteration requires a pass over all the samples, and in modern machine 
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learning applications, the number of samples can run into billions. Moreover, the alternating minimization 
has multiple local optima and reading the global optimum of ([2]) is NP-hard in general. This problem is 
severely amplified in the convolutional setting due to additional symmetries, compared to the usual dictionary 
learning setting (without the convolutional operation). Due to shift invariance of the convolutional operator, 
shifting a filter fi by some amount, and applying a corresponding negative shift on the activation Wi leaves 
the objective in Q unchanged. Thus, solving ([2|) is fundamentally ill-posed and has a large number of 
equivalent solutions. On the other hand, imposing shift-invariance constraints directly on the objective 
function in Q results in non-smooth optimization which is challenging to solve. 

Can we design methods that efficiently incorporate the shift invariance constraints into the learning 
problem? Can they break the symmetry between equivalent solutions? Are they scalable to huge datasets? 
In this paper, we provide positive answers to these questions. We propose a novel framework for learning 
convolutional models through tensor decomposition. We consider inverse method of moments to estimate the 
model parameters via decomposition of higher order (third or fourth order) input cumulant tensors. When 
the inputs x are generated from a convolutional model in with independent activation maps w*, i.e. a 
convolutional ICA model, we show that the moment tensors have a CP decomposition, whose components 
form a stacked circulant matrix. We propose a novel method for tensor decomposition when such circulant 
(i.e. shift invariance) constraints are imposed. 

Our method is a constrained form of the popular alternating least squares (ALS) method for tensor 
decompositior0. We show that the resulting optimization problem (in each ALS step) can be solved in 
closed form, and uses simple operations such as Fast Fourier transforms (FFT) and matrix multiplications. 
These operations have a high degree of parallelism: for estimating L filters, each of length n, we require 
0{\ogn -I- logL) time and 0{L^n^) processors. Thus, our method is embarrassingly parallel and scalable 
to huge datasets. We carefully optimize computation and memory costs by exploiting tensor algebra and 
circulant structure. We implicitly carry out many of the operations and do not form large (circulant) matrices 
to minimize storage requirements. 

Moreover, our method requires only one pass over data to compute the higher order cumulant of the input 
data or its approximation through sketching algorithms. This is a huge saving in running time compared 
to the alternating minimization method in Q which requires a pass over data in each step. Decoding all 
the activation maps Wi in each step of ([2]) is hugely expensive, and our method avoids it by estimating only 
the filters fi in the learning step. In other words, the activation maps wfs are averaged out in the input 
cumulant. After filter estimation, the activation maps are easily estimated using Q in one data pass. 

Experiments further demonstrate superiority of our method compared to alternating minimization. Our 
algorithm converges accurately and much faster to the true underlying filters compared to alternating mini¬ 
mization. Our algorithm is also orders of magnitude faster than the alternating minimization. 

1.1 Related Works 

The special case of o with one filter (L = 1) is a well studied problem, and is referred to as blind deconvo¬ 
lution [13j . In general, this problem is not identifiable, i.e. multiple equivalent solutions can exist [9]. It has 
been documented that in many cases alternating minimization produces trivial solutions, where the filter 
f = X is the signal itself and the activation is the identity function m- Therefore, alternative techniques have 
been proposed, such as convex programs, based on nuclear norm minimization and imposing hierarchical 
Bayesian priors for activation maps [2^. However, there is no analysis for settings with more than one filter. 
Incorporating Bayesian priors has shown to reduce the number of local optima, but not completely eliminate 
them [161120) . Moreover, Bayesian techniques are in general more expensive than alternating minimization 
in (HI). 

The extension of blind deconvolution to multiple filters is known as convolutive blind source separation or 
convolutive independent component analysis (ICA) [13) . Previous methods directly reformulate convolutive 
ICA as an ICA model, without incorporating the shift constraints. Moreover, reformulation leads to an 

^The ALS method for tensor decomposition is not to be confused with the alternating minimization method for solving 
While l[2]l acts on data samples, ALS operates on averaged moment tensors. 
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increase of number of hidden sources from L to nL in the new model, where n is the input dimension, 
which are harder to separate and computationally more expensive. Complicated interpolation methods m 
overcome these indeterminacies. In contrast, our method avoids all these issues. We do not perform Fourier 
transform on the input, instead, we employ FFTs at different iterations of our method to estimate the filters 
efficiently. 

The dictionary learning problem without convolution has received much attention. Recent results show 
that simple iterative methods can learn the globally optimal solution m- In addition, tensor decomposition 
methods provably learn the model, when the activations are independently drawn (the ICA model) [3] or are 
sparse (the sparse coding model) [3]. In this work, we extend the tensor decomposition methods to efficiently 
incorporate the shift invariance constraints imposed by the convolution operator. 


2 Model and Formulation 

Notation: Let [n] := {l,2,...,n}. For a vector v, denote the T'' element as v{i). For a matrix 

M, denote the T** row as M® and column as Mj. For a tensor T € its (ji,* 2 ,* 3 )**® entry 

is denoted by A column-stacked matrix M consisting of M/s (with same number of rows) is 

M := [Ml, M 2 , ■. ■, Ml\. Similarly, a row-stacked matrix M from M^s (with same number of columns) is 
M:= [Mi; M 2 ;...; Mi]. 

Cyclic Convolution: The 1-dimensional (1-D) n-cyclic convolution / ^ ic between vectors / and w is 

defined as v = f v(i) = f — j -f 1) mod n). On the other hand, linear convolution is the 

combination without the modulo operation (i.e. cyclic shifts) above. n-Cyclic convolution is equivalent to 
linear convolution, when n is at least twice the support length of both / and w |19j , which will be assumed. 
We drop the notation n in for convenience. Cyclic convolution in ([3]) is equivalent to f = Cir(/) • w, 
and 

Cir(/) := ^/(p)Gp G (Gp)] := 6 {{{i - j) mod n) = p - 1} , Vp G [nj. (3) 

p 

defines a circulant matrix. A circulant matrix Cir(/) is characterized by the vector /, and each column 
corresponds to a cyclic shift of /. 

Properties of circulant matrices: Let F be the discrete Fourier transform matrix whose (m, fc)-th 

entry is F™ = Vm, k G [n] where a;„ = exp(—2^). If U := y/nF~^, U is the set of eigenvectors 

for all n X n circulant matrices [12]. Let the Discrete Fourier Transform of a vector / be FFT(/), we express 
the circulant matrix Cir(/) as 

Cir(/) = t7diag(F • f)U^ = ;7diag(FFT(/))[/^ (4) 

This is an important property we use in algorithm optimization to improve computational efficiency. 

Column stacked circulant matrices: We will extensively use column stacked circulant matrices F := 
[Cir(/i),..., Cir(/i)], where C\r{fj) is the circulant matrix corresponding to filter fj. 

2.1 Convolutional Dictionary Learning/ICA Model 

We assume that the input a: G R" is generated as 

x= f*^w* = C\r{f*)w* =F* -w*, (5) 

i6[L] jG[L] 

where F* := [Cir(/[^), Cir(/|),..., Cir(/£)] is the concatenation or column stacked version of circulant ma¬ 
trices and w* is the row-stacked vector w* := G R"^. Recall that Cir(/*) is circulant matrix 


3 




(b)Reformulated model 


Figure 1: Convolutional tensor decomposition for learning convolutional ICA models, (a) The convolutional 
generative model with 2 filters, (b) Reformulated model where J-* is column-stacked circulant matrix. 



Figure 2: Convolutional tensor decomposition for learning convolutional ICA models. The third order 
cumulant is decomposed as filters. 


corresponding to filter /*, as given by Note that although J-* is a n by nL matrix, there are only nL 
free parameters. We never explicitly form the estimates T of but instead use filter estimates /j’s to 
characterize T. In addition, we can handle additive Gaussian noise in ®, but do not incorporate it for 
simplicity. 

Activation Maps: For each observed sample cc, the activation map w* in ([S]) indicates the locations 

where each filter /* is active and w* is the row-stacked vector w* := ■ ■ - w^]. We assume that 

the coordinates of w* are drawn from some product distribution, i.e. different entries are independent of 
one another and we have the independent component analysis (ICA) model in ([S]). When the distribution 
encourages sparsity, e.g. Bernoulli-Gaussian, only a small subset of locations are active, and we have the 
sparse coding model in that case. We can also extend to dependent distributions such as Dirichlet for w*, 
along the lines of [B], but limit ourselves to ICA model for simplicity. 

Learning Problem: Given access to N i.i.d. samples, X := [a;^, ..., x'^] G K"^^, generated according 

to the above model, we aim to estimate the true filters /*, for i G \L]. Once the filters are estimated, we 
can use standard decoding techniques, such as the square loss criterion in (I2|) to learn the activation maps 
for the individual maps. We focus on developing novel methods for filter estimation in this paper. 


3 Form of Cumulant Moment Tensors 

Tensor Preliminaries We consider 3rd order tensors in this paper but the analysis is easily extended to 
higher order tensors. For tensor T G entry is denoted by G [n],i 2 G 

[n ],*3 G [n]. A flattening or unfolding of tensor T G R is the column-stacked matrix of all its slices, given 
by unfold(T) := [T]:^.^ 2 , ■ ■ ■, G Define the Khatri-Rao product for vectors u G R“ and 

u G as a row-stacked vector [u 0 v] := [^(1)^; u{2)v; ...; u(a)u] G Khatri-Rao product is also defined 
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for matrices with same columns. For M € and M' G MqM' := [MiQM{, ..., Mc0M', ] G 
where Mi denotes the i*'' column of M. 


Cumulant: The third order cumulant of a multivariate distribution is a third order tensor, which uses 

(raw) moments up to third order. Let C 3 G denote the unfolded version of third order cumulant 

tensor, it is given by 

C 3 := E[a;(x 0 — unfold(Z) ( 6 ) 

where [Z]a,b,c ■= E[xa]E[xbXc] + E[xb]E[xaXc] + E[a:c]E[a;aa;b] - 2E[a:a]E[xb]E[a;c], Va, &, c G [n]. 

Under the convolution ICA model in Section [2T] we show that the third order cumulant has a nice tensor 
form, as given below. 

Lemma 3.1 (Form of Cumulants). The unfolded third order cumulant C 3 in ([HI) has the following decom¬ 
position form 

C3= E \*T*{T*QT*y , where h* (7) 

jg[nL] 

where T* denotes the j"* column of the column-stacked circulant matrix T* and X* is the third order cumulant 
corresponding to the (univariate) distribution ofw*{j). 

For example, if the Z*'' activation is drawn from a Poisson distribution with mean A, we have that A* = A. 
Note that if the third order cumulants of the activations, i.e. A*’s, are zero, we need to consider higher order 
cumulants. This holds for zero-mean activations and we need to use fourth order cumulant instead. Our 
methods extend in a straightforward manner for higher order cumulants. 




blki(J') 


blki(J-) 


Figure 3: Blocks of the column-stacked circulant matrix T. 

The decomposition form in ([7]) is known as the Candecomp/Parafac (CP) decomposition form [3] (the 
usual form has the decomposition of the tensor and not its unfolding, as above). We now attempt to recover 
the unknown filters /* through decomposition of the third order cumulants C 3 . This is formally stated 
below. 

Objective Function: Our goal is to obtain filter estimates /i’s which minimize the Frobenius norm || • Hf 
of reconstruction of the cumulant tensor C 3 , 

rmn IIC 3 - (J -0 J-)T|| 2 , 

s.t. blk^(J■) = t/diag(FFT(/^))t/^ ||/i ||2 = l, VZg[L], A = diag(A). (8) 

where blki(J^) denotes the Z*** circulant matrix in T. The conditions in (|S]) enforce blk/(J^) to be circulant 
and for the filters to be normalized. Recall that U denotes the eigenvectors for circulant matrices. The rest 
of the paper is devoted to devising efficient methods to solve (|S]). 

Throughout the paper, we will use to denote the column of if, and blk/(J^) to denote the Z*** 
circulant matrix block in T. Note that F G Fj G K” and blk/(J^) G 

4 Alternating Least Squares for Convolutional Tensor Decompo¬ 
sition 

To solve the non-convex optimization problem in ([5]), we consider the alternating least squares (ALS) method 
with column stacked circulant constraint. We first consider the asymmetric relaxation of ([ 8 ]) and introduce 
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separate variables T^ Q and 'H. for filter estimates along each of the modes to fit the third order cumulant 
tensor C 3 . We then perform alternating updates by fixing two of the modes and updating the third one. For 
instance, 

mm ||C3-J-A(i«©a)^||^s.t. blk,(J-) = C/ • diag(FFT(/i)) • C/^ II//II 2 = G [L] (9) 

Similarly, Q and 'H. have the same column-stacked circulant matrix constraint and are updated similarly in 
alternating steps. The diagonal matrix A is updated through normalization. 

We now introduce the Convolutional Tensor (CT) Decomposition algorithm to efficiently solve © in 
closed form, using simple operations such as matrix multiplications and fast Fourier Transform (FFT). We 
do not form matrices iF,G and TL G which are large, but only update them using filter estimates 

fit • • ■) fLt 9it ■ • ■ t 9Lt hit ■ • • hij. 

Using the property of least squares, the optimization problem in ® is equivalent to 

mm||C 3 ((?^ © - T\\l s.t. blk,(.F) = U ■ diag(FFT(/,)) • C/^ ||/z||^ = 1,V1 G [L] (10) 

when (T-L © Q) and A are full column rank. The full rank condition requires nL < 01 L < n, and it is a 

reasonable assumption since otherwise the filter estimates are redundant. In practice, we can additionally 
regularize the update to ensure full rank condition is met. Denote 

M:=C3((•H©e)^)^ ( 11 ) 

where f denotes pseudoinverse. Let blk/(M) and blki(A) denote the blocks of M and A.Since (fTOl) has 
block constraints, it can be broken down in to solving L independent sub-problems 

mm||blki(M).blki(A)t-U-diag(FFT(/z))-C/^||p s.t. ||/i||i = 1,VI G [L] (12) 


We present the main result now. 

Theorem 4.1 (Closed form updates). The optimal solution for (TT^ is given by 

blki(M),||-Ublki(M)!..C 


fr\p) = 


E 

z,je[n] 


^p-1 


E n-i 


Vp G [n],q := [i — j) mod n. 


(13) 


z,je[ra] 


Further A = diag(A) is updated as X{i) = || ||, for all i G [nL]. 

Thus, the reformulated problem in (TT^ can be solved in closed form efficiently. A bulk of the compu¬ 
tational effort will go into computing M in (HU. Computation of M requires 2L fast Fourier Transforms of 
length n filters and simple matrix multiplications without explicitly forming Q or 'H. We make this con¬ 
crete in the next section. The closed form update after getting M is highly parallel. With 0(n^A/log n) 
processors, it takes O(logn) time. 


5 Algorithm Optimization for Reducing Memory and Computa¬ 
tional Costs 

We now focus on estimating M := © f/)^)^ in (HU. If done naively, this requires inverting x nL 

matrix and multiplication of n x and n^ x nL matrices with 0{n^) time. However, forming and computing 
with these matrices is very expensive when n (and L) are large. Instead, we utilize the properties of circulant 
matrices and the Khatri-Rao product © to efficiently carry out these computations implicitly. We present 
our final result on computational complexity of the proposed method. 

Lemma 5.1 (Computational Complexity). With multi-threading, the running time is 0(logn-F logL) 
per iteration using 0{L^n^) processes. 

We now describe how we utilize various algebraic structures to obtain efficient computation. 
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Property 1 (Khatri-Rao product): {{'HQ G)^)^ = {H Q G){{H^H).* {G^ G))^, where .* denotes element¬ 
wise product. 

Computational Goals: Find {{H^ H). * {G^ G))^ first and multiply the result with Cz{TLqG) to find M. 
We now describe in detail how to carry out each of these steps. 

5.1 Challenge: Computing 

A naive implementation to find the matrix inversion {{H^H). * {G^G))^ is very expensive. However, we in¬ 
corporate the stacked circulant structure of G and H to reduce computation. Note that this is not completely 
straightforward since although G and H are column stacked circulant matrices, the resulting product whose 
inverse is required, is not circulant. Below, we show that however, it is partially circulant along different 
rows and columns. 

Property 2 (Block circulant matrix): The matrix {H^H). * {G^G) consists of row and column stacked 
circulant matrices. 

We now make the above property precise by introducing some new notation. Define column stacked 
identity matrix I := [/,...,/] € where / is n x n identity matrix. Let U := Blkdiag(C/, U,.. .U) € 

j^nLxnL block diagonal matrix with U along the diagonal. The first thing to note is that G and H, 

which are column stacked circulant matrices, can be written as 

e = I•U•diag(^;)•U^ u:=[FFT(5i);FFT(g2);...;FFT(gi)], (14) 

where gi, ■ ■ - Ql are the filters corresponding to G, and similarly for H, where the diagonal matrix consists 
of FFT coefficients of the respective filters hi,... ,hL- 


blk}(^) 


blki('F) 




blkf(’4') 


blki('F) 


Figure 4: Blocks of the row-and-column-stacked diagonal matrices ’4'. blk* (^) is diagonal. 

By appealing to the above form, we have the following result. We use the notation blk)(4t) for a matrix 
\I> £ denote (*,j)“* block of size n x n. 

Lemma 5.2 (Form of {H^H). * {G^G) )• We have 

{{n^n).*{G^G)y = (15) 

where ^ L by L blocks, each block of size n x n. Its {j,iy’' block is given by 

blk^('4') = diag'^(FFT(g,)) • diag'^(FFT(/i,)) • diag(FFT(gO) • diag(FFT(/iO) e (16) 

Therefore, the inversion of ifH^H). * {G^G) can be reduced to the inversion of row-and-column stacked 
set of diagonal matrices which form \F. Computing 4/ simply requires FFT on all 2L filters gi,... ,gL and 
hi, ... ,hL, i.e. 2L FFTs, each on length n vector. We propose an efficient iterative algorithm to compute 
via block matrix inversion theorem m in Appendix |B] 
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5.2 Challenge: Computing M = C^{V. © Q) ■ {{'hC'V.). * Q))"^ 

Now that we have computed * {G^Q)y efficiently, we need to compute the resulting matrix with 

C 3 {'H 0 G) to obtain M. We observe that the m*** row of the result M is given by 

M™= ^ U^diag*^ (z)$('")diag(u) Vm G [nL], (17) 

jelnL] 

where v := [FFT(gi);...; FFT( 5 i)], z := [FFT(/ii); ...; FFT(hL)] are concatenated FFT coefficients of the 
filters, and 


.= u^I^rMlU, [rW]j := [C3]((to-i)„, Vz,j,mG[n] (18) 

Note that and F^"*) are fixed for all iterations and needs to be computed only once. Note that 
is the result of taking row of the cumulant unfolding C 3 and matricizing it. Equation m uses the 
property that 0 G) is equal to the diagonal elments of 'HJ F^™) G- 

We now bound the cost for computing (flTl) . (1) Inverting takes 0(log L+log n) time with 0(n^L^/(log n+ 
logL)) processors according to appendix [BJ (2) Since diag(t!) and diag(z) are diagonal and \I> is a matrix 
with diagonal blocks, the overall matrix multiplication in equation m takes 0{L^n^) time serially with 
0{LF"n?) degree of parallelism for each row. Therefore the overall serial computation cost is 0{Lp"n^) with 
0{L^n^) degree of parallelism. With multi-threading, the running time is 0(1) per iteration using 0{Lp"n^) 
processes. (3) FFT requires 0(n log n) serial time, with 0(n) degree of parallelism. Therefore computing 2L 
FFT’s takes O(logn) time with 0{Ln) processors. 

Combining the above discussion, it takes 0(logL + logn) time with 0{LF"n?) processors. 


6 Experiments: Compare with Alternating Minimization 



Figure 5: Error comparison (on filters and reconstruct tensor) between our convolutional tensor method 
(proposed CT) and the baseline alternate minimization method (baseline AM). 
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Figure 6: (a) Running time comparison between our proposed CT method and the baseline AM method 

under varying L. (b) Running time comparison between our proposed CT method and the baseline AM 
method under varying N. 


We compare our convolutional tensor decomposition framework with alternating (between filters and 
activation map) minimization method in equation ([2]) where gradient descent is employed to update fi and 
Wi alternatively. 

The error comparison between our proposed convolutional tensor algorithm and the alternating minimiza¬ 
tion algorithm is in figure [S] We evaluate the errors for both algorithms by comparing the reconstruction 
of error and filter recovery error. Our algorithm converges much faster to the solution than the alternating 
minimization algorithm. In fact, the alternating minimization leads to spurious solution where the recon¬ 
struction error decreases but filter estimation error increases. Note that in experiments, we observe a more 
robust performance if we do deflation, meaning that we recover the filters one by one. The error bump in 
the reconstruction error curve in figure [5] is due to the random initialization. 

The running time is also reported in figure [S] between our proposed convolutional tensor algorithm and 
the alternating minimization. Our algorithm is orders of magnitude faster than the alternating minimization. 
Both our algorithm and alternating minimization scale linearly with number of filters. However convolutional 
tensor algorithm scales constantly with the number of samples whereas the alternating minimization scales 
linearly. 

7 Discussion 

There are number of questions for future investigation. We plan to extend learning sentence embeddings to 
handle more challenging tasks involving long documents such as plagiarism detection and reading compre¬ 
hension since our framework is scalable. Since our framework is parallelizable, it can handle long vectors and 
learn in an efficient manner. Our framework easily extends to convolutional models for higher dimensional 
signals, where the circulant matrix is replaced with block circulant matrices |12j . More generally, our frame¬ 
work can in principle handle general group structure which are diagonalizable. Deploying the frameworks 
in settings which incorporate Lie algebra is of great practical interest in computer vision and robotics. By 
combining the advantages of tensor methods with a general class of invariant representations, we expect to 
have a powerful paradigm for learning efficient embeddings. 
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Appendix for Convolutional Dictionary Learning 
through Tensor Factorization 


A Cumulant Form 

In [3], it is proved that in ICA model, the cumulant of observation x is decomposed into multi-linear transform 


of a diagonal cumulant of h. Therefore, we aim to find the third order cumulant for input x. 

As we know that the order moments for variable x is defined as 

■■= IE[a;n e K"X"X" (19) 

Let us use to denote the {i,j,ky^ entry of the third order moment. The relationship between 3*** 

order cumulant Kaand 3“' order moment 

Therefore the shift tensor is in this format: We know that the shift term 

[Z]a,b,c'-=^[x''cX&[xlx'’^+¥\xh\^[xax'"^+W^Xc]^[xaXb]-‘2E[xo\^[xb]^[xc], a,b,c€ [n] (21) 

It is known from [3] that cumulant decomposition in the 3 order tensor format is 

E[x(^x(^x]-Z = ^ \*F*®F*®T* (22) 

lG[nL] 

Therefore using the Khatri-Rao product property, 

unfoid( ^ x*F* ® T* ® j-;) = ^ a; j';(j'; 0 [t* 0(23) 

je[nL] j&[nL] 


Therefore the unfolded third order cumulant is decomposed as C 3 = J^*A* {JP* Q F*)^. 


B Parallel Inversion of ^ 

We propose an efficient iterative algorithm to compute via block matrix inversion theorem [In- 

Lemma B.l. (Parallel Inversion of row and eolumn stacked diagonal matrix) Let = '^ be partitioned 
into a block form: 

^ O 

R biki('^) 


= 


(24) 


where O := 


blkK'®') 


, and R := 

blk^-i(’®') J 

0 ( 1 ) time using 0 (n) processors, there inverse 0 /’®' is achieved by 


blk]^_ 2 (’®'), -.. ,blk^„;^(’4') . After inverting blk^('5') which takes 


(J^-i-Oblk^('5') ^R)-i ^0(blk^(’®')-i?(J^-i) ^0)-i 

-blk^(’®')"^i?(J^-i - Oblk^(’5')“^i?)-i (blki(’4') - R{J^-^)~^O)-^ 


[ -bik):('®-) ^i?( 

assuming that and blk|^ ^ are invertible. 


(25) 
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This again requires inverting R, O and . Recursively applying these block matrix inversion theorem, 
the inversion problem is reduced to inverting number of n by n diagonal matrices with additional matrix 
multiplications as indicated in equation (1251) . 

Inverting a diagonal matrix results in another diagonal one, and the complexity of inverting n x n 
diagonal matrix is 0(1) with 0(n) processors. We can simultaneous invert all blocks. Therefore with 
0{nL^) processors, we invert all the diagonal matrices in 0(1) time. The recursion takes L steps, for step 
i G [L] matrix multiplication cost is O(lognL) with 0{n?L/\og{nL)) processors. With L iteration, one 
achieves 0(logn + logL) running time with 0(n^L^/(logL + logn)) processors. 
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